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^ Abstract 

^ The estimation of asset return distributions is crucial for determining optimal trading 

strategies. In this paper we describe the constrained mixture model, based on a mixture 

■^j- of Gamma and Gaussian distributions, to provide an accurate description of price trends 

as being clearly positive, negative or ranging while accounting for heavy tails and high 

, , kurtosis. The model is estimated in the Expectation Maximisation framework and model 

t—} order estimation also respects the model's constraints. 
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1 INTRODUCTION 



The estimation of asset return distributions is crucial for determining optimal trading 
strategies. One convenient estimation approach selects a distribution model and esti- 
mates its parameters. The advantage of this approach is the ease with which probability 
distributions can be calibrated and applied in post-processing. The disadvantage of as- 
suming a particular parametric distribution is that inferences and decisions depend crit- 
ically on the choice of distribution. For example, asset returns frequently feature large 
"outlying" values, making distributions with light tails inapplicable. 

Semi-parametric methods attempt to capture the advantages but not the disadvan- 
tages of a parametric specification of a returns distribution by using a more flexible 
functional form. Most prominent among the semi-parametric distributions are mixtures 
of distributions. They provide a flexible specification and, under certain conditions, can 
approximate distributions of any form. 

2 MIXTURE MODELS AND EXTENSIONS 
2.1 Classical Mixture Models 

A standard mixture probability density of a random variable X, whose value is denoted 
by x, is defined as 

K 

px(x; v) = } y K k p x {x; 0k)- (1) 
k=i 

The mixture density has K components (or states) and is defined by the parameter set 
v = {9, 7r}, where it = {tt\, • • • , ttk} is the set of weights given to each component 
and 9 = {9\, • • • , 9k} is the set of parameters describing each component distribution. 

By far, the most popular mixture model is the Gaussian mixture model (GMM). It is 
given as 

K 

Px(x) = ^7r fc AA(x;/i fc ,o-fc), (2) 
fc=i 

where each component parameter vector 9k now consists of the mean and variance pa- 
rameters, and respectively (see Appendix [X] for the definition of the probability 
distributions). 

The Gaussian mixture distribution can be, and has been, estimated in the Maximum 
Likelihood or in a Bayesian framework (see Q for both estimation methods). The 
Gaussian mixture distribution is often referred to as a universal approximator CD, an 
indication of the fact that it can approximate distributions of any form. Figure ([TJ, 
for example, shows a 3 component GMM approximating a sample with the histogram 
shown in the top plot. 

The number of components needed to model the data depends very much on the 
problem at hand. In some sense, it is the discrepancy between the data distribution and 
the mixture model that determines the number of components (aka model order). Data 
distributions with heavy tails require two or more light tailed components to compen- 
sate. In Figure ([I]), for example, the data was drawn from a single Gamma distribution 
yet three Gaussian components were needed to capture most aspects of the Gamma 
distribution. 

More components require larger sample sizes to ensure adequate calibration. In the 
extreme case there may be insufficient data available to calibrate a given mixture model 
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Figure 1: Histogram of a sample drawn randomly from a Gamma Distribution and the esti- 
mated Gaussian mixture model. 
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with a certain degree of accuracy. In short, while Gaussian mixture models are very 
flexible they may not be the most appropriate model. If more is known about the data 
distribution, such as its behaviour in the tails, incorporation of this knowledge can only 
help improve the model. 



2.2 Gamma Mixture Models 

The Gamma mixture distribution is another commonly used model. They are used if the 
data values are only positive. Another reason for their use is because Gamma densities 
exhibit much heavier tails than Gaussian densities. Thus, events that deviate from the 
mean by several standard deviations are much more probable than under a Gaussian 
model assumption. As a consequence, large return values are not underestimated under 
the Gamma mixture assumption. 

The Gamma mixture model (GaMM) is given as 

K 

Px(x) = ^2 -x k Qa{X; a k , (3 k ), (3) 

k=l 

where each component parameter vector 6 k now consists of the parameters shape and 
precision (inverse scale or rate), denoted respectively by a k and f3 k (see Appendix [A] for 
notation). 

The Gamma mixture distribution can be estimated via the Maximum Likelihood [2] 
or the Bayesian framework 0. Similar to its Gaussian counterpart, the Gamma mixture 
distribution can approximate any distribution on M + . 

Note that, for Bayesian inference, there is no natural prior for the shape parameter 
of the Gamma distribution. Priors can be specified but require full MCMC (instead of 
Gibbs) sampling methods for estimation. With regard to maximum likelihood estimation 
note also that there is no closed form solution for the maximum likelihood estimator of 
the shape parameter - unless approximation assumptions are made |2| which then 
permit the use of gradient decent optimisation JH. Practice has shown, however, that 
even when making only small adjustments to the parameters the estimates frequently 
violate the positivity constraints, most notably that of the shape parameter. 

Such limitations can be avoided, however, via the unique mapping that exists from 
the density's mean and variance to its shape and scale parameters 



o 



JL 

a 2 



Thus, through the estimates resulting from the closed form solution of the mean and 
variance, the shape and scale parameter can be uniquely determined. 



3 Constrained Mixture Models 

Financial asset returns feature long positive and negative tails. In addition there is a 
large concentration of values around the origin. Modelling this constellation of distribu- 
tions can be achieved by means of a Gaussian mixture model. However, as we pointed 
out earlier, heavy tail behaviour is more parsimoniously modelled with Gamma distribu- 
tions. This fact leads to the obvious attempt to model large negative and positive values 
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by Gamma distributions while a mixture of Gaussian densities takes on the task of mod- 
elling the sharply peaked distribution near the origin. This model is hereafter referred 
to as the constrained mixture model (CMM) or the Gauss-Gamma mixture distribution. 

3.1 Constraining by a Gauss-Gamma Mixture Distribution 

The main difference to standard mixture model is the association of subsets of com- 
ponents k = 1, • • • K to only positive and only negative valued observations. We will 
use the short hand notation kQ for mixture component indices associated with positive 
observations. Likewise, kQ refers to the set of mixture component indices responsible 
for all negative valued observations. To specify the remaining set of component indices 
we use the symbol fe0, i.e. kQ = {1 • • • K} \ {fc® U kQ}. For example, a K = 5 
component mixture model may be split into two components for positive valued obser- 
vations kQ = {1,2} and one component for negative valued observations, kQ = {5}, 
whilst the remaining components are kQ = {3, 4} and apply to all observations. 

The mixture component distributions are chosen according to which domain they 
are responsible for. We define three groups of mixture components as follows (see 
Appendix |A| for notation): 

Near Zero Domain: Observations with values around zero are modelled by a set of 
Gaussian distributions which are all restricted to have zero mean. The probability 
of x is thus 

P x (x;6 k )=N(x;n k = 0;a 2 k ) Vfe £ kQ (5) 

Positive Domain: Observations with positive values are modelled by a set of Gamma 
distributions. The probability of x is thus 

Px(x\O k ) = Qa(x; a k ; f3 k ) Vfc £ kQ (6) 

if the value x of X is in IR + and zero, otherwise. 

Negative Domain: Observations with negative values are modelled by a set of Gamma 
distributions, and so the probability of x is 

Px(x\9 k ) = Ga(-x; a k ; p h ) Vk £ kQ (7) 

if the value x of X is in ]R~ and zero, otherwise. 
Thus, the full constrained mixture model is given as 

nv ( T \ = j '£kek@ ir kSa(x;a k ,p k ) + Eieko 7r l N{X;0,af) if x £ M+ 
P{) \E k eke^Ga(-x;a k ,P k ) + £ l£fc0 ^N(x- 0, a}) ifx£M K) 

Note that negative values are modelled by a Gamma distribution with sign-reversed 
argument. In our notation, k takes any of the index values of the states associated with 
constrained domain. A further consequence of our notation is that the parameter set 
9 consist of subsets 9 k , each of which holds also the parameters of all other domains, 
e.g. #3 = {fj,3, CK3; fy}. The reason for this is that we will be using the means 
and variances of the Gamma distribution to compute the distributions rate and scale 
parameter according to equations Q. 

An example of a sample drawn from the constrained mixture and it's continuous 
density function are shown in Figure ([2]). 
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Figure 2: Histogram of a sample drawn randomly from a constrained mixture model and the 
continuous density from which the sample was generated. 

3.2 Alternative Approaches to Constraining Distributions 

There are other ways to constrain the model. One way would be through the use of 
rectified Gaussian distributions 121 [6|. However, the models in @ use a cut-off function 

cut(x) = max(x, 0) (9) 

which places too much weight on zero. Also,the CMM is considerably simpler while 
perfectly satisfying the required constraints. 

4 Mixture Model Estimation 

To motivate the estimation procedure we need to expand the mixture model. In partic- 
ular, we introduce, for each datum, a latent indicator variable. This variable indicates 
which of the mixture component is responsible for the datum in question. The (marginal) 
distribution that any indicator variable selects the A>th component is given by the weight 
7Tfc that is associated with the A;-th mixture component. 

4. 1 Latent Indicator Variable Representation of Mixture Mod- 
els 

Let us first define the following one-dimensional observation set X = {X\, ■ ■ ■ ,X t ,--- Xt}, 
of length T and indexed with t. The set is assumed to be generated by a isT-component 
mixture model. 

To indicate the mixture component from which a sample was drawn, we introduce 
a latent random variable, St. The value of St, which we denote by st, is a vector of 
length K. The components of the vector, s tk are either or 1. We set the vector's /c-th 
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component, s tk = 1 to indicate that the k-th mixture component is selected, while all 
other states are set to 0. As a consequence, 



K 



(10) 



We can now specify the joint probability distribution of X and S in terms of a 
marginal distribution P$ t (s t ; n) and a conditional distribution Px t \s t ( x t \ s t', 9) as 



Px,s(x,s;v) = Y[Px t \s t (.xt\s t ;0)Ps t {st;ir), 



(11) 



and where the parameter vector v = {9, it}. 

The marginal distribution Ps t (s t ; it) are drawn from a multinomial distribution that 
is parameterised by the mixing weights it = {tti ■ ■ ■ ttk}- Thus, 



K 



k=l 



or, more simply, 



p ( s t k = 1) = Tfc- 
Naturally the weights must satisfy that G [0, 1] and that 

K 

1 = ^^. 

k=l 



(12) 



(13) 



(14) 



As for the conditional distribution, P Xt \s t (xt\st', 0), its form depends on the value 
of the latent variable St. For the constrained mixture model we have in particular 



f N(x t ;0;al 



Px t (xt\s tk = I; 9 k ) = < 



) Vx t and k £ kQ 

Ga(xt;ctk; (3 k ) x t £R + and k G k® 

Qa(—xt;ak; /3k) xt G M _ and k G kQ 

otherwise 



(15) 



The full model is thus defined as 



T K 



P x ,s(x,s;v) = YlYl^ k k < 



t k=i 



M(x t ;0;al) k G kQ and \/x t 

Ga(x t ;ak; fik) k G kQ and x t G 1 

Ga(-x t ;ak; Pk) k e kQ and i t eS 

otherwise 



_ (16) 



To summarise, in the latent variable representation of mixture model, the compo- 
nents for each sample are selected with probability \/k, reflecting the mixture weight 
-Kk- The components that are selected for a particular datum xt depend on the sign of 
the sample xt- For positive x, xt is modelled by a mixture of Gaussian and "positive" 
Gamma distributions. For negative x t , x t is modelled by the same mixture of Gaussian 
and a set of "negative" Gamma distributions. 
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4.2 Maximum Likelihood Estimation 



Estimation of the mixture model can be accomplished by maximising directly the model 
given by equation ([8]). This, however, requires the use of optimisation methods such as 
the Newton-Raphson algorithm. Using the complete data mixture model description 
instead leads to an optimisation algorithm known as the Expectation-Maximisation al- 
gorithm. The algorithm produces set of coupled yet analytic update equations that can 
be iterated until convergence has been achieved. What is more, convergence is easily 
monitored since the convergence criterion is simply one of the quantities that the algo- 
rithm computes anyhow. 

The maximum likelihood method of estimating mixture models used here is known 
as the Expectation Maximisation (EM) algorithm. The goal of the EM is to maximise 
the likelihood of the data given the model, i.e. maximise 

C(v) = log^ P ^s(x,s;v)\ =J2J2 s ^ lo ^ 7r kPx t (x t ;e k )} (17) 

Is J t=l k=l 

If the states of S = {So, Si, - ■ ■ , St, ■ ■ ■ St} had been known then the estimation 
of the model parameters n, 6 is trivial. Conditioned on the state variables and the ob- 



servations, the equation ( 17 1 could be maximised with respect to the model parameters. 
However, which value that the state variables take is unknown. This suggests an alterna- 
tive two-stage iterated optimisation algorithm: If we know the expected of S, one could 
use this expectation in the first step to perform a weighted maximum likelihood esti- 



mation of ( 17 1 with respect to the model parameters. These estimates will be incorrect 
since the expectation S is inaccurate. So, in the second step, one could update the ex- 
pected value of all S subject to the pretending the model parameters it and 9 are known 
and held fixed at their values from the past iteration. This is precisely the strategy of the 
Expectation Maximisation (EM) algorithm (T). 

The EM algorithm for the CMM iteratively optimises C{v) in two stages fl]: 

E-step In this step, the parameters v are held fixed at the old values, v old , obtained from 
the previous iteration (or at their initial settings during the algorithm's initialisa- 
tion). Conditioned on the observations, the E-step then computes the probability 
of the state variables St, Vi given the current model parameters and observation 
data, i.e. 

P St \ Xt {s t \x t ,v old ) oc P XtlSt (xt\ S f,e)Ps t (sf,7T old ) (18) 
In particular, we compute (and drop the superscript for clarity's sake) 

p ( n old-, Px t \S t (xt\st k = l;e k )lT k 

p St\x t (st k = l\xt,v ) = = — (19) 

2Zs h p x t \s t {xt\st t = l;O e )Tr e 

The likelihood terms Px t \s t (xt\st k = are evaluate using the observation 

densities defined for each of the states. Thus, 



p x t \s t (xt\st k = 1;0jO = < 



M(x t ;0;al) k £ kQ and \/x t 

Qa(xt;a.k] Pk) k e k@ and x t <G M 

Qa(—xt; ah', Pk) k € kQ and x t Gl 

otherwise 



(20) 



To simplify the notation we use j t to symbolise the vector values computed in ( 19 ), 



which are the probabilities for each component k being selected for observation 
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xt- The components of j t are denoted by 7 tfe , i.e. 

tl 



lt k =P St \xMt k = l\xf,v old ). (21) 



Note that, as a consequence of equation ( 19 1, 1 = J2^=i lt k - 



M-step In this step, the latent state probabilities are considered given and maximisation 
is performed with respect to the parameters 6: 

v new = argmax£(u) (22) 

V 

This results in the update equations for the parameters for the probability distribu- 
tions are as follows 



T 

1 ■ 



1 T 



T 

t=l 

°l = f^2"ft k )(xt-fJ-k) 2 (24) 
t=i 

These two parameters are computed for all states. 

For those states that are governed by a Gamma distribution, the shape and scale 
parameters are computed using the relations 

a k = ^ (25) 

Pk = — 4 (26) 

This approach circumvents the need for approximations or an iterative gradient 
decent approach to optimising the shape parameter ctk . 



5 Results 

Before applying the model to some data it is worth studying the model and the training 
algorithm's behaviour on a simulated data set. 
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Figure 3: Histogram of a sample drawn from a constrained mixture model with 2 Gamma 
distributions for the positive x-values, 2 Gamma distributions for the negative x-values and 
one Gaussian distribution centred at the origin. 



5.1 Simulated Results 

We generated data from a pre-specifled constrained mixture model. In the model, there 
were 2 Gamma distributions assigned to the positive domain. These had, respectively, 
the shape parameters a v \ = 20 and a P 2 = 10 and scale parameters f3 p \ = 3 and f3 P 2 = 
4. Assigned to the negative domain were also by 2 Gamma distributions. Respectively, 
their shape parameters are a n ± = 20 and a n 2 = 10 and scale parameters /3 n i = 3 and 
(3 n 2 = 4. Finally, a single Gaussian distribution was also defined to be centred at the 
origin with a variance of ctq = 1. 

A total of 1000 samples were drawn from the constrained distribution. The empir- 
ical relative counts, i.e. the histogram, is shown in Figure Q. Model calibration was 
subsequently repeated for a range for model orders. In particular, the number of kernels 
for the negative values ranged from 1 — 3, similarly for the positive values and the cen- 
tred Gaussians. Thus a total of 27 model configurations were evaluated. The penalised 
likelihood (BIC [1]) for each model is shown in Figure Q. The minimum penalised 
likelihood, i.e. the most parsimonious, configuration was found for precisely the config- 
uration from which the data was sampled (2 negative Gamma p.d.f.s, 2 positive Gamma 
p.d.f.s, and 1 Gaussian p.d.f.). The resultant estimated constrained model is shown in 
Figure (|5]>. 

A number of things are noteworthy. The total number of 27 model configurations 
implies a large number of computations. This is due to constrained nature of the model. 
These computations are not necessary in that it is similarly possible to estimate the total 
number of mixture components using a standard Gaussian mixture model, which in this 
example would imply maximally 9 components. The allocation of kernels to domains 
in the constrained mixture can then be determined through visual inspection of the fitted 
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Penalised Constrained MM 




Figure 4: Penalised likelihood values for 27 configurations of the contrained mixture model. 
Each model is denoted by kQ /kQ /kQ, a system of numbers indicating that there are kQ 
Gamma distributions defined for the negative domain, kQ Gamma distributions defined for 
the positive domain and kQ Gaussian distributions centred at the origin. 

Gaussian mixture. This approach is approximately statistically correct. The implied 
assumption is that each of the Gamma distributions is sufficiently accurately fitted by a 
Gaussian distribution. 

Penalising the log-likelihood using BIC, or any other off-the-shelf penalty term, is 
theoretically incorrect. This is due to the fact that standard penalty criteria assume that 
all model parameters are used to explain the same number of samples - as expressed 
by plogT in the BIC case, p being the number of model parameters and T the sample 
size. This condition does not apply in the constrained mixture model case. Gamma 
distributions are only used to fit samples that fall within their domain of responsibility. 
While it is possible to modify the penalty criteria to match the constrained model, the 
standard penalty factors suffice in practice. The standard penalty factors are at worst 
overly conservative, i.e. the recommend model order is smaller than one obtained by a 
constrained-model matching criterion. 



Estimated Constrained Mixture Density 
O.T" | 1 1 1 1 




Figure 5: The estimated density for a sample drawn from the 2/1/2-constrained mixture 
model described in the text. 
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5.2 Asset Returns 



We now describe the application of the model (and the model order selection via pe- 
nalised likelihood) to actual financial data. The data is the US Treasury 10-year bond 
price, collected over the period of 5 years on a daily basis - exactly 1513 trading days. 
The asset's returns were calculated as the difference of the day's average price from that 
of the previous day. The sample's histogram is shown in Figure 

Histogram of a Tick Data 

0.08 | 1 1 1 1 1 1 1 1 1 

0.07 - 
0.06 - 

VJ 

c 0.05 - 




-2500 -2000 -1500 -1000 -500 500 1000 1500 2000 

Return (Arb Units) 

Figure 6: Histogram of a sample of 1513 ticks obtained from difference of the average daily 
price of the US Treasury 10-year bonds. 

The optimal model order that was determined using maximum likelihood estimation 
and penalising using the BIC penalty criterion. The configuration thus calculated was 
1/1/1, i.e. 1 Gamma distributions defined for the negative domain, 1 Gaussian distribu- 
tions centred at the origin and 1 Gamma distributions defined for the positive domain. 
The resulting mixture model fit is shown in Figure ^ and suggest a good fit. 
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Figure 7: The estimated density for the difference in average daily price of the US Treasury 
10-year bonds, using a 1/1/1-constrained mixture model. 



The constrained mixture model provides a simple statistical decomposition into neg- 
ative, positive and near zero domains. The motivation for this model is the accurate 
description of price trends as being clearly positive, negative or ranging while account- 
ing for heavy tails and high kurtosis. 

The EM algorithm for the constrained mixture model is only marginally different 
from that of standard mixture models. Model estimation can be performed using stan- 
dard likelihood penalisation methods. Even though theoretically over-penalising, the 
study on simulated data has shown that their use does produce an acceptable model 
complexity estimates. 

Issues that remain to be solved are largely identifiability issues. As an example, a 
Gaussian distribution at the centre, flanked by two identical Gamma distributions pro- 
vide as good a model as one where the two Gamma distributions are replace by one or 
two Gaussian distributions. While it is of theoretical concern and may imply increase 
sensitivity to the initialisations of the model parameters, in practice, such precise sym- 
metry may never arise. 



6 Discussion 
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A Standard Probability Distributions 

A.l The Normal or Gaussian Probability Density 

The Normal probability density, denoted by Af(x; fi, a 2 ), is given as 

P ^> = vk^^ <27) 

where fi is the mean and the variance is a 2 . 

A.2 The Gamma Distribution 

The Gamma probability density, denoted by Qa(x; a, j3), is given as 

Px(x) = ^^-'e^ (28) 
r(a) 

where a is the shape parameters and (3 is the inverse scale (or rate). 
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